Overview

Let’s load the Prosper data and take a look at the number row and column.

Row count:

## [1] 113937

Column count:

## [1] 81

There are 113937 listing in the dataset with 81 variables. For the scope of this project, I am going to limit the number of variable. The question is which variables.

Looking at how prosper works[1], I add variables that fits the following criteria:

  1. Basic information, information that a user gives to the site (loan amount, loan category, etc.) when they want to register for a loan.
  2. Credit profile, information that may aid in generating the ‘Prosper rating’, ‘Borrower rate’ and ‘Term’. This can be seen in the loan listing page on Prosper site.
  3. Other information, ‘Prosper Rating’, ‘Borrower rate’, ‘Term’, ‘ListingCreationDate’
## 'data.frame':    113937 obs. of  15 variables:
##  $ DelinquenciesLast7Years : int  4 0 0 14 0 0 0 0 0 0 ...
##  $ PublicRecordsLast10Years: int  0 1 0 0 0 0 0 1 0 0 ...
##  $ DebtToIncomeRatio       : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ BankcardUtilization     : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ RevolvingCreditBalance  : num  0 3989 NA 1444 6193 ...
##  $ DaysWithCreditLine      : num  5126 7159 4837 11926 4264 ...
##  $ InquiriesLast6Months    : int  3 3 0 0 1 0 0 3 1 1 ...
##  $ LoanOriginalAmount      : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ ListingCategory         : Factor w/ 21 levels "Not available",..: 1 3 1 17 3 2 2 3 8 8 ...
##  $ EmploymentStatus        : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ AnnualIncome            : num  37000 73500 25000 34500 115000 ...
##  $ BorrowerRate            : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ Term                    : Factor w/ 3 levels "12","36","60": 2 2 2 2 2 3 2 2 2 2 ...
##  $ ProsperRating           : Factor w/ 7 levels "AA","A","B","C",..: NA 2 NA 2 5 3 6 4 1 1 ...
##  $ ListingCreationDate     : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...

Let’s take a look at the data summary:

##  DelinquenciesLast7Years PublicRecordsLast10Years DebtToIncomeRatio
##  Min.   : 0.000          Min.   : 0.0000          Min.   : 0.000   
##  1st Qu.: 0.000          1st Qu.: 0.0000          1st Qu.: 0.140   
##  Median : 0.000          Median : 0.0000          Median : 0.220   
##  Mean   : 4.155          Mean   : 0.3126          Mean   : 0.276   
##  3rd Qu.: 3.000          3rd Qu.: 0.0000          3rd Qu.: 0.320   
##  Max.   :99.000          Max.   :38.0000          Max.   :10.010   
##  NA's   :990             NA's   :697              NA's   :8554     
##  BankcardUtilization RevolvingCreditBalance DaysWithCreditLine
##  Min.   :0.000       Min.   :      0        Min.   : 1036     
##  1st Qu.:0.310       1st Qu.:   3121        1st Qu.: 5702     
##  Median :0.600       Median :   8549        Median : 7297     
##  Mean   :0.561       Mean   :  17599        Mean   : 7646     
##  3rd Qu.:0.840       3rd Qu.:  19521        3rd Qu.: 9276     
##  Max.   :5.950       Max.   :1435667        Max.   :24898     
##  NA's   :7604        NA's   :7604           NA's   :697       
##  InquiriesLast6Months LoanOriginalAmount           ListingCategory 
##  Min.   :  0.000      Min.   : 1000      Debt consolidation:58308  
##  1st Qu.:  0.000      1st Qu.: 4000      Not available     :16965  
##  Median :  1.000      Median : 6500      Other             :10494  
##  Mean   :  1.435      Mean   : 8337      Home improvement  : 7433  
##  3rd Qu.:  2.000      3rd Qu.:12000      Business          : 7189  
##  Max.   :105.000      Max.   :35000      Auto              : 2572  
##  NA's   :697                             (Other)           :10976  
##       EmploymentStatus  AnnualIncome       BorrowerRate    Term      
##  Employed     :67322   Min.   :       0   Min.   :0.0000   12: 1614  
##  Full-time    :26355   1st Qu.:   38404   1st Qu.:0.1340   36:87778  
##  Self-employed: 6134   Median :   56000   Median :0.1840   60:24545  
##  Not available: 5347   Mean   :   67296   Mean   :0.1928             
##  Other        : 3806   3rd Qu.:   81900   3rd Qu.:0.2500             
##               : 2255   Max.   :21000035   Max.   :0.4975             
##  (Other)      : 2718                                                 
##  ProsperRating                      ListingCreationDate
##  C      :18345   2013-10-02 17:20:16.550000000:     6  
##  B      :15581   2013-08-28 20:31:41.107000000:     4  
##  A      :14551   2013-09-08 09:27:44.853000000:     4  
##  D      :14274   2013-12-06 05:43:13.830000000:     4  
##  E      : 9795   2013-12-06 11:44:58.283000000:     4  
##  (Other):12307   2013-08-21 07:25:22.360000000:     3  
##  NA's   :29084   (Other)                      :113912

Univariate Plots Section

Basic information

Loan amount

Several sharp line on the amount, no surprise here, people tend to borrow in whole numbers. Interesting to note that 4000 is the most common amount people borrowed, followed by 10000 and 15000.

Loan category

Most people borrow to consolidate their debts.

Employment Status

Most borrowers are employed.

Annual Income

At binwidth=1000, we can see sharp line around some amount, which make sense, since user tend to input a whole number. The histogram is skewed to the left.

Credit profile

Payment history

Most borrower have no deliquencies in the last 7 years or public records in the last 10 years. If I remove the borrower with 0 deliquencies and 0 public records. I got:

While most borrowers has 0 deliquencies, there still almost 4000 borrowers who have at least 1 deliquencies in the last 7 years and 2000 borrowers have at least 1 public records in the last 10 years.

Debt burden

Debt to Income Ratio

A debt income ratio is the percentage of a consumer’s monthly gross income that goes toward paying debts. The data is capped at 10.01, debt-to-income ratio larger then 1000% will be returned as 1001%.

Removing the upper quantile on the data we got:

Revolving Credit Balance

Revolving Credit Balance is the total outstanding balance that the borrower owes on open credit cards or other revolving credit accounts.

Bankcard Utilization

Bankcard utilization is the sum of the balances owed on open bankcards divided by the sum of the card’s credit limits. Lower usually means better.

There are interestingly 2 peaks in the plot, first there are a lot of borrowers who have almost 0% Bankcard Utilization and then another peak near 100%. There are some borrowers who have utilization > 1.00 (100%).

Length of credit history

Length of credit history is the number of days from the date when the oldest account on the borrower’s credit record was opened till today.

There is a credit line going up to 60 years.

Other information

Most loans have 36 months term.

The median for the borrower rate is 18.4% and mean 19.28%. There are 6 observation that has more then 40% borrower rate.

Univariate Analysis

What is/are the main feature(s) of interest in your dataset?

The main features of the data are:

I chose this variables, because these variable is visible from the UI[1].

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I added ListingCreationDate. I added it just to see if there is “trend” in the behavior.

Did you create any new variables from existing variables in the dataset?

Yes, Days with credit line.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

All of the money related variables (LoanOriginalAmoun, RevolvingCreditBalance and AnnualIncome) are positively skewed. I do not transform the data for univariate analysis.


Bivariate Plots Section

People who borrow > 25000 has annual income of >= 100000 looks like there some kind of rule, that if you borrow > 25000 the the minimal annual income is 100000.

That is not too informative. Let’s try too break the DebtToIncomeRatio into several bins.

A quick look at the newly created variable.

Let’s have another look at the relationship between DebtToIncomeRatio with BorrowerRate.

We can see that the BorrowerRate median increases the higher the DebtToIncomeRatio.

Let’s separate the borrower rate into bin as well.

Let’s take a look at the relation of Term with other variables.

Let’s take a look at relationship between DeliquenciesLast7Years and PublicRecordsLast10Years.

Let’s check ProsperRating relationship with other variables.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I wanted to see how several features affect the borrower rate, term and prosper rating. I put several data into “bins” as this makes it a bit easier to work with. By using this on borrower rate, debt to income ratio and delinquicies observations, we can paint a clearer picture on the relationship between features.

We can see for instance the borrower rate increases as debt to income ratio increases. The term seems to be related with loan original amount, the bigger the amount the longer the term.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

No.

What was the strongest relationship you found?

If we plot the ProsperRating against other features, the plot became much clearer. For instance we can quickly see that the debt to income ratio for rating AA will be lower then other rating.

The borrower rate shows an even clearer picture. The better your rating the lower your borrower rate.


Multivariate Plots Section

There is some different from year to year feature distribution within rating. For instance the the borrower rate distribution we can see that the borrower rate for rating AA in 2013 and 2014 almost all between 0-10%. For rating B we seems to have borrower rate of 20-30% in 2011 and 2012, but 10-20% in 2013 and 2014.


Final Plots and Summary

Plot One

Description One

This plot shows the effect of several factor on ProsperRating. For instance if a borrower have less bankcard utilization usually, he/she will get a better rating. On the other hand the longer you have credit line (DaysWithCreditLine) the better.

Plot Two

Description Two

This plot shows the borrower rate distribution for borrower based on ProsperRating. If a borrower is rated AA, he/she will likely to have 0-10% borrower rate.

Plot Three

Description Three

This plot is another look at plot 2 with added dimension of listing creation date. The plot shows the trend of borrower rate from 2009 and 2014 faceted by ProsperRating. We can see that if a borrower is rated AA in 2009 they can get 10-20% borrower rate. In 2013 and 2014, if you rated AA you will get 0-10% borrower rate. You are rated E in 2009 most borrower will get 30-40% rate, but in 2014 you can actually get 20-30% borrower rate.


Reflection

The Prosper data has a lot of variables, for this scope of the project I limited the number of variables to investigate. The first part is to select which variables to investigate. After much thought, I use the variable that a borrower can actually see in the loan listing page[1]. I do this because I assume these are the metric that is important for lender to look at before actually lending money, so it is a good start.

Initially I wanted to show the relationship between the variables with borrower rate, for instance debt to income ratio vs borrower rate, bankcard utilization vs borrower rate. To ease the exploration I have put several variables into “bins”. Putting it into bins makes it easier for me to show the relationships between variables.

It is also much easier to show relationship based on ProsperRating then borrower rate. For instance if we faceted debt to income ratio with ProsperRating, it is easier to see that the lower your debt to income ratio the better is your rating. And then show the better you rating the better is you borrower rate.

Even on this limited number of variables, there is a lot of thing that we can investigate further.

References

[1] https://www.prosper.com/help/topics/how-to-read-a-loan-listing/